Earthquakes form an integral part of our planet’s geology. It is crucial to gain an understanding of the frequency and strength of these seismic activities, as this information is essential in both the cause and prevention of damaging earthquakes. Fortunately for us, the United States Geological Survey (USGS) captures comprehensive data on Earthquakes magnitude and location across the United States and its surrounding areas.
earthquake <- read.csv("C:/Users/buson/OneDrive/Desktop/r_project/project_folder/data/usgs_main.csv")
earthquake <- earthquake[!is.na(earthquake$mag),]
Since the time column contains data that is not very useful as a whole, we extract the month and the day from each record, creating two new variables. As a result, we create a new data set using the command data.frame, called “df_earthquake”. From this point on we will work on it.
## months days latitude longitude depth mag rms
## 1 03 04 38.75967 -122.71967 1.61 1.24 0.04
## 2 03 04 38.83383 -122.81550 1.82 1.13 0.02
## 3 03 04 35.59667 -120.27133 11.57 2.31 0.01
## 4 03 04 35.92917 -117.66083 3.25 0.88 0.13
## 5 03 04 62.36020 -149.63450 9.80 1.40 0.52
## 6 03 04 17.96133 -66.84883 13.23 2.37 0.14
## place type
## 1 3km SW of Anderson Springs, CA earthquake
## 2 8km NW of The Geysers, CA earthquake
## 3 11km SE of Shandon, CA earthquake
## 4 22km E of Little Lake, CA earthquake
## 5 24 km NNE of Susitna North, Alaska earthquake
## 6 4 km ESE of Maria Antonia, Puerto Rico earthquake
First of all, we are going to use a subset of the “df_earthquake”
data set. The main reason is the dimension: in fact, more than 75
thousand records can
be difficult to handle. To avoid the problem, we are going to use the
library “dplyr”, in particular, the function “sample_n()” to extract
randomly 2500 records.
library(dplyr)
set.seed(1)
df <- sample_n(tbl = df_earthquake, size = 2500)
In this first part, our focus is the monthly distribution of records. We, therefore, create a new data set using the command “data.frame()” in combination with the function “table()” . We store this new data set with the name “count_months”.
count_months <- data.frame(table(as.numeric(df$months)))
head(count_months)
## Var1 Freq
## 1 3 242
## 2 4 261
## 3 5 281
## 4 6 273
## 5 7 239
## 6 8 260
To show the result we opted for a histogram. The idea is that we want to show the number of records in each month (frequency). For plotting data, we decided to use two libraries, “ggplot2” and “plotly”. The main graph is created with the function “ggplot()” and stored into a variable “p”. Then the plot is performed by the function “ggplotly()”.
As we can see from the graph the month with the highest frequency is October. However. The minimum frequency is registered in the month of December. This is because the last record date is the 12th of December. That means that the month was not complete. This fact explains the low number compared to the other months. We have a similar situation with March, since the first day on the record is the 3rd (even if there are just 2 days missing). Our analysis is that the month does not influence the frequency of earthquakes. We want now to analyse if the month has some influence on the magnitude of an earthquake.
First of all, we want to perform an ANOVA test to evaluate if changing the month affects significantly the magnitude of an earthquake. In other words, we want to measure if the intensity of an earthquake is influenced by temporal aspects.
fit <- aov(mag~months, data = df)
summary(fit)
## Df Sum Sq Mean Sq F value Pr(>F)
## months 9 30 3.319 2.205 0.0192 *
## Residuals 2490 3749 1.506
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The summary result shows that the ANOVA model is significant for a test at a significance level of alpha = 0.05. In fact, the p-value of the test is 0.0192 and the F-statistic is big enough. However, we believe that this statistic is due to random circumstances. In fact, changing an arbitrary parameter as the significance level alpha to a lower value (for example 0.01) could be enough to completely change our point of view. We, therefore, decide not to consider this result in our model. To support our theory we draw some box-plot.
Also in this case we use the libraries “ggplot2” and “plotly”. In addition, the type of graph that we have selected is the box-plot. The reason is that it let us show the distribution of the values for each month. In particular, with the interactive plot, it is possible to visualize the median, quartiles and outliers.
After all these considerations, we conclude that months and magnitude are not in a relation to causality. Every seismic event depends on the Earth’s crust movements. As a result, also the magnitude depends on it. Moreover, we want to specify an aspect related to the magnitude that will appear frequently in our data. The magnitude of an earthquake could be negative. These records are part of the phenomenon of micro-seismicity. These kinds of earthquakes are not felt by humans but devices can detect them. Since their intensity (logarithmic scale) is lower than 0 in the Richter scale, these values are negative.